Niels Richard Hansen
16/02-2016
The lasso acronym means
least absolute shrinkage and selection operator
and was introduced by Robert Tibshirani in
Regression Shrinkage and Selection via the Lasso
published in the first issue of Journal of the Royal Statistical Society: Series B, 1996.
You say las-SOO
and I say LAs-so.
From Spanish lazo.
We consider linear regression models of a response \( Y \) on a vector \( X = (X_1, \ldots, X_p) \) of predictors: \[ E(Y) = \mathbf{X}^T \beta= \sum_{j=1}^p X_j \beta_j. \]
With \( n \) observations we organize \( Y_1, \ldots, Y_n \) in a vector \( \mathbf{Y} \) and \( X_1, \ldots, X_n \) in a \( n \times p \) matrix \( \mathbf{X} \).
The least squares estimator is \[ \hat{\beta} = \mathop{\arg \min}\limits_{\beta} \|\mathbf{Y} - \mathbf{X}\beta\|_2^2 = \mathop{\arg \min}\limits_{\beta} \sum_{i=1}^n (Y_i - X_i^T \beta)^2. \] where \( \|\mathbf{z}\|_2^2 = \sum_{i=1}^n z_i^2 \) denotes the Euclidean norm.
The response, \( Y \), will be lpsa. The remaining 8 variables will be the predictors in \( X \).
lpsa lcavol lweight age lbph svi lcp gleason pgg45
1 -0.4308 -0.5798 2.769 50 -1.386 0 -1.386 6 0
2 -0.1625 -0.9943 3.320 58 -1.386 0 -1.386 6 0
3 -0.1625 -0.5108 2.691 74 -1.386 0 -1.386 7 20
4 -0.1625 -1.2040 3.283 58 -1.386 0 -1.386 6 0
All variables are standardized upfront.
prostate <- scale(prostate, TRUE, TRUE)
## Intercept removed due to standardization
prostateLm <- lm(lpsa ~ . - 1,
data = as.data.frame(prostate))
summary(prostateLm)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
lcavol 0.5762 0.0892 6.46 5.4e-09 ***
lweight 0.2309 0.0741 3.11 0.0025 **
age -0.1370 0.0711 -1.93 0.0571 .
lbph 0.1216 0.0724 1.68 0.0966 .
svi 0.2732 0.0860 3.18 0.0021 **
lcp -0.1285 0.1082 -1.19 0.2385
gleason 0.0308 0.0966 0.32 0.7507
pgg45 0.1089 0.1061 1.03 0.3072
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.603 on 89 degrees of freedom
Multiple R-squared: 0.663, Adjusted R-squared: 0.633
F-statistic: 21.9 on 8 and 89 DF, p-value: <2e-16```
prostateLasso <- glmnet(x = prostate, y = prostate[, "lpsa"],
exclude = 1, intercept = FALSE)
plot(prostateLasso, label = TRUE, lwd = 2)
Lasso is a regression technique that gives a family of coefficients \( ({}^{s \!}\beta)_{s \geq 0} \) indexed by a tuning parameter \( s \geq 0 \).
You must enable Javascript to view this page properly.
\[ {}^{1 \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq 1} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]
where \[ \|\beta\|_1 = \sum_{i=1}^p |\beta_i| \]
You must enable Javascript to view this page properly.
\[ {}^{s \!}\hat{\beta} = \mathop{\arg \min}\limits_{\beta: \|\beta\|_1 \leq s} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2. \]
Lasso defines a family of estimates.
\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \| \mathbf{Y} - \mathbf{X} \beta \|^2_2 + \lambda \|\beta\|_1. \]
The monotonely decreasing data dependent map
\[ s(\lambda) = \|\beta^{\lambda}\|_1 \]
from \( (0,\infty) \) to \( (0, s_{\mathrm{max}}) \) gives the relation
\[ \beta^{\lambda} = {}^{s(\lambda) \!}\beta \]
between the contrained lasso and the penalized lasso.
If \( \mathbf{X} \) is orthogonal, i.e. \( \mathbf{X}^T \mathbf{X} = \mathbf{I}_p \), then \[ \beta_i^{\lambda} = \mathrm{sign}(\hat{\beta}_i) \max\{|\hat{\beta}_i| - \lambda, 0\} \] is soft thresholding of the least squares estimator.
This can be compared to hard thresholding \[ \hat{\beta}_i 1(|\hat{\beta}_i| > \lambda) \]
and linear shrinkage \[ \frac{\hat{\beta}_i}{1 + \lambda} \] related to ridge regression.
Lasso is for fixed \( s \) the solution of a quadratic optimization problem with \( 2^p \) linear inequality contraints.
Tibshirani proposed an interative algorithm for adding active contraints. It requires the sequential solution of quadractic optimization problems with linear inequality contraints.
Tibshirani implemented public domain functions for S-PLUS.
“To obtain them, use file transfer protocol to lib.stat.cmu.edu and retrieve the file S/lasso, or send an electronic mail message to statlib@lib.stat.cmu.edu with the message send lasso from S.”
Homotopy methods as in On the LASSO and its Dual, Osborne, Presnell and Turlach, 2000, JCGS, compute the entire solution path \[ \lambda \mapsto \beta^{\lambda}. \]
The least angle regression (LAR) and its lasso modification (LARS) was developed by Efron et al., and is a fast homotopy algorithm.
prostateLars <- lars(x = prostate[, -1], y = prostate[, "lpsa"],
intercept = FALSE, normalize = FALSE)
This is a pure R implementation!
plot(prostateLars)
coefficients(prostateLars)
lcavol lweight age lbph svi lcp gleason pgg45
[1,] 0.000 0.000 0.0000 0.0000 0.000 0.000 0.0000 0.00000
[2,] 0.365 0.000 0.0000 0.0000 0.000 0.000 0.0000 0.00000
[3,] 0.400 0.000 0.0000 0.0000 0.035 0.000 0.0000 0.00000
[4,] 0.483 0.149 0.0000 0.0000 0.158 0.000 0.0000 0.00000
[5,] 0.487 0.161 0.0000 0.0000 0.166 0.000 0.0000 0.00917
[6,] 0.505 0.182 0.0000 0.0443 0.199 0.000 0.0000 0.03388
[7,] 0.517 0.202 -0.0519 0.0763 0.211 0.000 0.0000 0.05595
[8,] 0.522 0.213 -0.0812 0.0937 0.219 0.000 0.0107 0.06064
[9,] 0.576 0.231 -0.1370 0.1216 0.273 -0.128 0.0308 0.10891
The coordinate descent algorithm solves the lasso optimization problem very efficiently.
1) If you don't get the joke look up the fast Fourier transform.
From P. Bühlmann's discussion of R. Tibshirani, Regression shrinkage and selection via the lasso: a retrospective, JRSSB, 2011.
From Vincent et al., Modeling tissue contamination to improve molecular identification of the primary tumor site of metastases, Bioinformatics, 2013.
The squared error loss is replaced by the negative log-likelihood:
\[ \beta^{\lambda} = \mathop{\arg \min}\limits_{\beta} \ell(\beta) + \lambda \|\beta\|_1. \]
With \( K \) groups and \( p \) predictors there will be \( Kp \) parameters.
In the example we have \( K = 9 \) and \( p = 384 \) giving 3456 parameters.
load("miRNA.RData")
miRNAlasso <- glmnet(Xprim, Yprim, family = "multinomial")
plot(miRNAlasso)
miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial",
type.measure = "class")
plot(miRNAlasso, ylim = c(0, 0.4))
Ordinary lasso penalty
Group lasso penalty
Group lasso penalty
Sparse group lasso penalty
The (default) sparse group lasso penalty for multinomial lasso is defined as \[ \begin{aligned} \mathrm{Pen}(\beta) & = \alpha \sum_{k,j} |\beta_{kj}| + (1-\alpha) \sum_{k=1}^M \gamma_k \sqrt{\sum_{j=1}^p \beta_{kj}^2} \\ & = \alpha \|\beta\|_1 + (1-\alpha) \sum_{k=1}^M \gamma_k \|\beta_{k\cdot}\|_2 \end{aligned} \] for \( \alpha \in [0,1] \). The value of \( \alpha = 1 \) gives lasso and \( \alpha = 0 \) gives group lasso.
The group lasso gives parameters associated with a predictor that are either all 0 or all different from 0. The sparse group lasso encourages a group selection, but allows for selection of only some parameters associated with a predictor.
lambda <- msgl.lambda.seq(Xprim, Yprim, lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda)
lambda <- msgl.lambda.seq(Xprim, Yprim, alpha = 0,
lambda.min = 0.01)
miRNAmsgl <- msgl.cv(Xprim, Yprim, lambda = lambda, alpha = 0)
The R package msgl implements sparse group lasso for multinomial models.
It supports the default grouping of all parameters associated with one predictor, and it supports user defined groups of predictors.
The glmnet supports group lasso for multinomial models with the default grouping but not sparse group lasso or grouping of predictors.
miRNAlasso <- cv.glmnet(Xprim, Yprim, family = "multinomial",
type.measure = "class",
type.multinomial = "grouped")
plot(miRNAlasso, ylim = c(0, 0.4))
A professional cowboy recognizes the layman by his usage of the word lasso.
To the cowboy it is simply the rope.
My next cool statistical method has to be called Rope.
There is R in Rope, and, hopefully, the professional R users will recognize the layman by his usage of lasso - once we have Rope!
Thanks!
The packages used explicitly in the presentation:
library("glmnet") ## Lasso via coordinate descent
library("lars") ## Lasso via the lar algorithm
library("msgl") ## Multinomial sparse group lasso
library("reshape2") ## For 'melt'
library("ggplot2") ## Plotting
library("lattice") ## Plotting
library("rgl") ## 3d
library("misc3d") ## More 3d
Other packages worth mentioning:
library("LiblineaR") ## Fast logistic lasso etc.
library("grpreg") ## Group lasso etc.
library("glamlasso") ## Lasso for array models